Improve tokenizer guidance for local ONNX models#10338
Conversation
Instructions for tokenizer DLL requirements
|
@Notheisz57 : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change. |
|
Learn Build status updates of commit f8993f4: ✅ Validation status: passed
For more details, please refer to the build report. |
|
Can you review the proposed changes? IMPORTANT: When the changes are ready for publication, adding a #label:"aq-pr-triaged" |
There was a problem hiding this comment.
Pull request overview
This PR updates the CREATE EXTERNAL MODEL documentation to clarify how to build and expose a tokenizer DLL (tokenizers-cpp) used by SQL Server when running local ONNX embedding generation (for example, via AI_GENERATE_EMBEDDINGS).
Changes:
- Adds a C++ example showing an expected exported entry point for a tokenizer DLL.
- Adds a note about the DLL export signature potentially changing, and reiterates the expected DLL filename.
| const std::string& json_blob, // contents of `tokenizer.json` | ||
| const std::string& text, // input text to tokenize | ||
| std::vector<int>& out_ids // output token IDs (the embeddings) |
| The tokenizer must be compiled as a shared dynamic link library using MSVC, and must export a specific entry point: | ||
|
|
||
| ```cpp | ||
| #include "tokenizers_cpp.h" // for example: `tokenizers-cpp\include\tokenizers_cpp.h` | ||
| #include <string> | ||
| #include <vector> | ||
|
|
||
| extern "C" __declspec(dllexport) | ||
| void LoadBlobJsonAndEncode( | ||
| const std::string& json_blob, // contents of `tokenizer.json` | ||
| const std::string& text, // input text to tokenize | ||
| std::vector<int>& out_ids // output token IDs (the embeddings) |
| > [!NOTE] | ||
| > Ensure the created dll is named **tokenizers_cpp.dll** | ||
| > The exact signature of this export may change. Ensure the created dll is named **tokenizers_cpp.dll** |
Clarify the
tokenizer-cppexport, called by SQL Server in response to a SQL query involvingAI_GENERATE_EMBEDDINGS.Intermediate build instructions are reasonably out-of-scope, but this specific export was necessary for SQL Server 2025 to successfully tokenize.